{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BojqmYOsPk0A"
      },
      "source": [
        "##### Copyright 2024 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "FtMmJ-pvPfNl"
      },
      "outputs": [],
      "source": [
        "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IyATsAVlPz1W"
      },
      "source": [
        "# Interacting with Gemma 2 using SGLang\n",
        "\n",
        "[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open-source language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.\n",
        "Gemma models are well-suited for various text-generation tasks, including question-answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or your cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.\n",
        "\n",
        "[SGLang](https://github.com/sgl-project/sglang?tab=readme-ov-file) is a serving framework for Large Language models. It offers a fast backend runtime and a flexible front end language allowing you to control and customize model interactions.\n",
        "\n",
        "In this notebook, you will learn how to prompt Gemma 2 model in various ways using the **SGLang** http server, backend runtime and frontend language in a Google Colab environment.\n",
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Using_with_SGLang.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nFgaq_--Qg-O"
      },
      "source": [
        "## Setup\n",
        "\n",
        "### Select the Colab runtime\n",
        "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
        "\n",
        "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
        "2. Select **Change runtime type**.\n",
        "3. Under **Hardware accelerator**, select **T4 GPU**.\n",
        "\n",
        "### Setup Hugging Face\n",
        "\n",
        "**Before you dive into the tutorial, let's get you set up with Hugging face:**\n",
        "\n",
        "1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).\n",
        "\n",
        "2. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.\n",
        "\n",
        "**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bLEUJYZ8QmGz"
      },
      "source": [
        "### Configure your HF token\n",
        "\n",
        "Add your Hugging Face token to the Colab Secrets manager to securely store it.\n",
        "\n",
        "1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src=\"https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg\" alt=\"The Secrets tab is found on the left panel.\" width=50%>\n",
        "2. Create a new secret with the name `HF_TOKEN`.\n",
        "3. Copy/paste your HF token key into the Value input box of `HF_TOKEN`.\n",
        "4. Toggle the button on the left to allow notebook access to the secret."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "hK-qUiQGQbe5"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from google.colab import userdata\n",
        "\n",
        "# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env\n",
        "# vars as appropriate for your system.\n",
        "os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5DrB-trDQsgE"
      },
      "source": [
        "### Install dependencies\n",
        "\n",
        "First, you must install the necessary packages for SGLang."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "LLjxxhk2Qrf_"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting sglang[all]\n",
            "  Downloading sglang-0.3.5-py3-none-any.whl.metadata (21 kB)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (2.32.3)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (4.66.6)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (1.26.4)\n",
            "Requirement already satisfied: IPython in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (7.34.0)\n",
            "Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (75.1.0)\n",
            "Collecting jedi>=0.16 (from IPython->sglang[all])\n",
            "  Downloading jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)\n",
            "Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (4.4.2)\n",
            "Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (0.7.5)\n",
            "Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (5.7.1)\n",
            "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (3.0.48)\n",
            "Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (2.18.0)\n",
            "Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (0.2.0)\n",
            "Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (0.1.7)\n",
            "Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from IPython->sglang[all]) (4.9.0)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->sglang[all]) (3.4.0)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->sglang[all]) (3.10)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->sglang[all]) (2.2.3)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->sglang[all]) (2024.8.30)\n",
            "Collecting anthropic>=0.20.0 (from sglang[all])\n",
            "  Downloading anthropic-0.39.0-py3-none-any.whl.metadata (22 kB)\n",
            "Collecting litellm>=1.0.0 (from sglang[all])\n",
            "  Downloading litellm-1.51.3-py3-none-any.whl.metadata (32 kB)\n",
            "Requirement already satisfied: openai>=1.0 in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (1.52.2)\n",
            "Collecting tiktoken (from sglang[all])\n",
            "  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n",
            "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (2.5.0+cu121)\n",
            "Collecting vllm==0.6.3.post1 (from sglang[all])\n",
            "  Downloading vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (10 kB)\n",
            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (5.9.5)\n",
            "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (0.2.0)\n",
            "Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (9.0.0)\n",
            "Collecting transformers>=4.45.2 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading transformers-4.46.1-py3-none-any.whl.metadata (44 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.1/44.1 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: tokenizers>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (0.19.1)\n",
            "Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (3.20.3)\n",
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (3.10.10)\n",
            "Collecting uvicorn[standard] (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)\n",
            "Requirement already satisfied: pydantic>=2.9 in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (2.9.2)\n",
            "Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (10.4.0)\n",
            "Requirement already satisfied: prometheus-client>=0.18.0 in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (0.21.0)\n",
            "Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)\n",
            "Collecting lm-format-enforcer==0.10.6 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading lm_format_enforcer-0.10.6-py3-none-any.whl.metadata (16 kB)\n",
            "Collecting outlines<0.1,>=0.0.43 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading outlines-0.0.46-py3-none-any.whl.metadata (15 kB)\n",
            "Requirement already satisfied: typing-extensions>=4.10 in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (4.12.2)\n",
            "Requirement already satisfied: filelock>=3.10.4 in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (3.16.1)\n",
            "Collecting partial-json-parser (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl.metadata (6.2 kB)\n",
            "Requirement already satisfied: pyzmq in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (24.0.1)\n",
            "Collecting msgspec (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading msgspec-0.18.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)\n",
            "Collecting gguf==0.10.0 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading gguf-0.10.0-py3-none-any.whl.metadata (3.5 kB)\n",
            "Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (8.5.0)\n",
            "Collecting mistral-common>=1.4.4 (from mistral-common[opencv]>=1.4.4->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading mistral_common-1.4.4-py3-none-any.whl.metadata (4.6 kB)\n",
            "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (6.0.2)\n",
            "Requirement already satisfied: einops in /usr/local/lib/python3.10/dist-packages (from vllm==0.6.3.post1->sglang[all]) (0.8.0)\n",
            "Collecting compressed-tensors==0.6.0 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading compressed_tensors-0.6.0-py3-none-any.whl.metadata (6.8 kB)\n",
            "Collecting ray>=2.9 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading ray-2.38.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (17 kB)\n",
            "Collecting nvidia-ml-py (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading nvidia_ml_py-12.560.30-py3-none-any.whl.metadata (8.6 kB)\n",
            "Collecting torch (from sglang[all])\n",
            "  Downloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)\n",
            "Collecting torchvision==0.19 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading torchvision-0.19.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.0 kB)\n",
            "Collecting xformers==0.0.27.post2 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)\n",
            "Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)\n",
            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->sglang[all]) (1.13.1)\n",
            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->sglang[all]) (3.4.2)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->sglang[all]) (3.1.4)\n",
            "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->sglang[all]) (2024.10.0)\n",
            "Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->sglang[all])\n",
            "  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
            "Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->sglang[all])\n",
            "  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
            "Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->sglang[all])\n",
            "  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n",
            "Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->sglang[all])\n",
            "  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
            "Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->sglang[all])\n",
            "  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
            "Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->sglang[all])\n",
            "  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
            "Collecting nvidia-curand-cu12==10.3.2.106 (from torch->sglang[all])\n",
            "  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
            "Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch->sglang[all])\n",
            "  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n",
            "Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch->sglang[all])\n",
            "  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n",
            "Collecting nvidia-nccl-cu12==2.20.5 (from torch->sglang[all])\n",
            "  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)\n",
            "Collecting nvidia-nvtx-cu12==12.1.105 (from torch->sglang[all])\n",
            "  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)\n",
            "Collecting triton==3.0.0 (from torch->sglang[all])\n",
            "  Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)\n",
            "Collecting interegular>=0.3.2 (from lm-format-enforcer==0.10.6->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading interegular-0.3.3-py37-none-any.whl.metadata (3.0 kB)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.10.6->vllm==0.6.3.post1->sglang[all]) (24.1)\n",
            "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->sglang[all]) (12.6.77)\n",
            "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from anthropic>=0.20.0->sglang[all]) (3.7.1)\n",
            "Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from anthropic>=0.20.0->sglang[all]) (1.9.0)\n",
            "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from anthropic>=0.20.0->sglang[all]) (0.27.2)\n",
            "Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from anthropic>=0.20.0->sglang[all]) (0.6.1)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from anthropic>=0.20.0->sglang[all]) (1.3.1)\n",
            "Requirement already satisfied: parso<0.9.0,>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->IPython->sglang[all]) (0.8.4)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from litellm>=1.0.0->sglang[all]) (8.1.7)\n",
            "Requirement already satisfied: jsonschema<5.0.0,>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from litellm>=1.0.0->sglang[all]) (4.23.0)\n",
            "Collecting python-dotenv>=0.2.0 (from litellm>=1.0.0->sglang[all])\n",
            "  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)\n",
            "Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.10/dist-packages (from pexpect>4.3->IPython->sglang[all]) (0.7.0)\n",
            "Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->IPython->sglang[all]) (0.2.13)\n",
            "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken->sglang[all]) (2024.9.11)\n",
            "Collecting decord (from sglang[all])\n",
            "  Downloading decord-0.6.0-py3-none-manylinux2010_x86_64.whl.metadata (422 bytes)\n",
            "Collecting hf-transfer (from sglang[all])\n",
            "  Downloading hf_transfer-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)\n",
            "Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (0.24.7)\n",
            "Requirement already satisfied: orjson in /usr/local/lib/python3.10/dist-packages (from sglang[all]) (3.10.10)\n",
            "Collecting python-multipart (from sglang[all])\n",
            "  Downloading python_multipart-0.0.17-py3-none-any.whl.metadata (1.8 kB)\n",
            "Collecting torchao (from sglang[all])\n",
            "  Downloading torchao-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)\n",
            "Collecting uvloop (from sglang[all])\n",
            "  Downloading uvloop-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)\n",
            "Collecting zmq (from sglang[all])\n",
            "  Downloading zmq-0.0.0.zip (2.2 kB)\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "Collecting modelscope (from sglang[all])\n",
            "  Downloading modelscope-1.19.2-py3-none-any.whl.metadata (40 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.7/40.7 kB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->anthropic>=0.20.0->sglang[all]) (1.2.2)\n",
            "Collecting starlette<0.42.0,>=0.40.0 (from fastapi!=0.113.*,!=0.114.0,>=0.107.0->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading starlette-0.41.2-py3-none-any.whl.metadata (6.0 kB)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->anthropic>=0.20.0->sglang[all]) (1.0.6)\n",
            "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->anthropic>=0.20.0->sglang[all]) (0.14.0)\n",
            "Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata->vllm==0.6.3.post1->sglang[all]) (3.20.2)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->sglang[all]) (3.0.2)\n",
            "Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm>=1.0.0->sglang[all]) (24.2.0)\n",
            "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm>=1.0.0->sglang[all]) (2024.10.1)\n",
            "Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm>=1.0.0->sglang[all]) (0.35.1)\n",
            "Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm>=1.0.0->sglang[all]) (0.20.0)\n",
            "Collecting tiktoken (from sglang[all])\n",
            "  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n",
            "Requirement already satisfied: opencv-python-headless<5.0.0,>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from mistral-common[opencv]>=1.4.4->vllm==0.6.3.post1->sglang[all]) (4.10.0.84)\n",
            "Collecting lark (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)\n",
            "Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (1.6.0)\n",
            "Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (3.1.0)\n",
            "Collecting diskcache (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)\n",
            "Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (0.60.0)\n",
            "Collecting datasets (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)\n",
            "Collecting pycountry (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)\n",
            "Collecting pyairports (from outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading pyairports-2.1.1-py3-none-any.whl.metadata (1.7 kB)\n",
            "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.9->vllm==0.6.3.post1->sglang[all]) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.9->vllm==0.6.3.post1->sglang[all]) (2.23.4)\n",
            "Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm==0.6.3.post1->sglang[all]) (1.1.0)\n",
            "Requirement already satisfied: aiosignal in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm==0.6.3.post1->sglang[all]) (1.3.1)\n",
            "Requirement already satisfied: frozenlist in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm==0.6.3.post1->sglang[all]) (1.5.0)\n",
            "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.45.2->vllm==0.6.3.post1->sglang[all]) (0.4.5)\n",
            "Collecting tokenizers>=0.19.1 (from vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading tokenizers-0.20.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)\n",
            "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm==0.6.3.post1->sglang[all]) (2.4.3)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm==0.6.3.post1->sglang[all]) (6.1.0)\n",
            "Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm==0.6.3.post1->sglang[all]) (1.17.0)\n",
            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm==0.6.3.post1->sglang[all]) (4.0.3)\n",
            "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->sglang[all]) (1.3.0)\n",
            "Collecting httptools>=0.5.0 (from uvicorn[standard]->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading httptools-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)\n",
            "Collecting watchfiles>=0.13 (from uvicorn[standard]->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading watchfiles-0.24.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)\n",
            "Collecting websockets>=10.4 (from uvicorn[standard]->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading websockets-13.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)\n",
            "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp->vllm==0.6.3.post1->sglang[all]) (0.2.0)\n",
            "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (17.0.0)\n",
            "Collecting dill<0.3.9,>=0.3.0 (from datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (2.2.2)\n",
            "Collecting xxhash (from datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n",
            "Collecting multiprocess<0.70.17 (from datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all])\n",
            "  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)\n",
            "Collecting fsspec (from torch->sglang[all])\n",
            "  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)\n",
            "Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (0.43.0)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (2024.2)\n",
            "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (2024.2)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets->outlines<0.1,>=0.0.43->vllm==0.6.3.post1->sglang[all]) (1.16.0)\n",
            "Downloading sglang-0.3.5-py3-none-any.whl (436 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m436.8/436.8 kB\u001b[0m \u001b[31m26.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl (194.8 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.8/194.8 MB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m797.2/797.2 MB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading compressed_tensors-0.6.0-py3-none-any.whl (92 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m92.0/92.0 kB\u001b[0m \u001b[31m8.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading gguf-0.10.0-py3-none-any.whl (71 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m71.6/71.6 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading lm_format_enforcer-0.10.6-py3-none-any.whl (43 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m43.7/43.7 kB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m410.6/410.6 MB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m41.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.7/23.7 MB\u001b[0m \u001b[31m64.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m823.6/823.6 kB\u001b[0m \u001b[31m53.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m664.8/664.8 MB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m121.6/121.6 MB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.5/56.5 MB\u001b[0m \u001b[31m11.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.2/124.2 MB\u001b[0m \u001b[31m7.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m196.0/196.0 MB\u001b[0m \u001b[31m5.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m176.2/176.2 MB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.1/99.1 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading torchvision-0.19.0-cp310-cp310-manylinux1_x86_64.whl (7.0 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.0/7.0 MB\u001b[0m \u001b[31m109.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.4/209.4 MB\u001b[0m \u001b[31m1.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl (20.8 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m20.8/20.8 MB\u001b[0m \u001b[31m90.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading anthropic-0.39.0-py3-none-any.whl (198 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m198.4/198.4 kB\u001b[0m \u001b[31m16.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m69.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading litellm-1.51.3-py3-none-any.whl (6.3 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.3/6.3 MB\u001b[0m \u001b[31m92.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading fastapi-0.115.4-py3-none-any.whl (94 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m94.7/94.7 kB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading interegular-0.3.3-py37-none-any.whl (23 kB)\n",
            "Downloading mistral_common-1.4.4-py3-none-any.whl (6.0 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m103.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m60.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading outlines-0.0.46-py3-none-any.whl (101 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m101.9/101.9 kB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl (19 kB)\n",
            "Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)\n",
            "Downloading ray-2.38.0-cp310-cp310-manylinux2014_x86_64.whl (66.0 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66.0/66.0 MB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading transformers-4.46.1-py3-none-any.whl (10.0 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.0/10.0 MB\u001b[0m \u001b[31m84.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading tokenizers-0.20.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.0/3.0 MB\u001b[0m \u001b[31m74.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading decord-0.6.0-py3-none-manylinux2010_x86_64.whl (13.6 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.6/13.6 MB\u001b[0m \u001b[31m68.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading hf_transfer-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.6/3.6 MB\u001b[0m \u001b[31m35.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading modelscope-1.19.2-py3-none-any.whl (5.8 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.8/5.8 MB\u001b[0m \u001b[31m63.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading msgspec-0.18.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (210 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m210.3/210.3 kB\u001b[0m \u001b[31m16.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading nvidia_ml_py-12.560.30-py3-none-any.whl (40 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.5/40.5 kB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl (9.9 kB)\n",
            "Downloading python_multipart-0.0.17-py3-none-any.whl (24 kB)\n",
            "Downloading torchao-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.2/2.2 MB\u001b[0m \u001b[31m67.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading uvicorn-0.32.0-py3-none-any.whl (63 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m63.7/63.7 kB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading uvloop-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m69.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading httptools-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (442 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m442.1/442.1 kB\u001b[0m \u001b[31m26.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading starlette-0.41.2-py3-none-any.whl (73 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.3/73.3 kB\u001b[0m \u001b[31m7.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading watchfiles-0.24.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (425 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m425.7/425.7 kB\u001b[0m \u001b[31m30.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading websockets-13.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (164 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m164.1/164.1 kB\u001b[0m \u001b[31m15.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading datasets-3.1.0-py3-none-any.whl (480 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m38.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m16.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading diskcache-5.6.3-py3-none-any.whl (45 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.5/45.5 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading lark-1.2.2-py3-none-any.whl (111 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.0/111.0 kB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading pyairports-2.1.1-py3-none-any.whl (371 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m371.7/371.7 kB\u001b[0m \u001b[31m27.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.3/6.3 MB\u001b[0m \u001b[31m97.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m11.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m14.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m18.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hBuilding wheels for collected packages: zmq\n",
            "  Building wheel for zmq (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for zmq: filename=zmq-0.0.0-py3-none-any.whl size=1265 sha256=72c1d4d8588518e3127e52eb54d1bbbe8c082845ed37bd5d5c8edb19f807a508\n",
            "  Stored in directory: /root/.cache/pip/wheels/ab/c5/fe/d853f71843cae26c123d37a7a5934baac20fc66f35a913951d\n",
            "Successfully built zmq\n",
            "Installing collected packages: torchao, pyairports, nvidia-ml-py, zmq, xxhash, websockets, uvloop, uvicorn, triton, python-multipart, python-dotenv, pycountry, partial-json-parser, nvidia-nvtx-cu12, nvidia-nccl-cu12, nvidia-cusparse-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, msgspec, lark, jedi, interegular, httptools, hf-transfer, gguf, fsspec, diskcache, dill, decord, watchfiles, tiktoken, starlette, nvidia-cusolver-cu12, nvidia-cudnn-cu12, multiprocess, modelscope, torch, tokenizers, sglang, prometheus-fastapi-instrumentator, lm-format-enforcer, fastapi, anthropic, xformers, transformers, torchvision, ray, mistral-common, litellm, datasets, compressed-tensors, outlines, vllm\n",
            "  Attempting uninstall: nvidia-nccl-cu12\n",
            "    Found existing installation: nvidia-nccl-cu12 2.23.4\n",
            "    Uninstalling nvidia-nccl-cu12-2.23.4:\n",
            "      Successfully uninstalled nvidia-nccl-cu12-2.23.4\n",
            "  Attempting uninstall: nvidia-cusparse-cu12\n",
            "    Found existing installation: nvidia-cusparse-cu12 12.5.4.2\n",
            "    Uninstalling nvidia-cusparse-cu12-12.5.4.2:\n",
            "      Successfully uninstalled nvidia-cusparse-cu12-12.5.4.2\n",
            "  Attempting uninstall: nvidia-curand-cu12\n",
            "    Found existing installation: nvidia-curand-cu12 10.3.7.77\n",
            "    Uninstalling nvidia-curand-cu12-10.3.7.77:\n",
            "      Successfully uninstalled nvidia-curand-cu12-10.3.7.77\n",
            "  Attempting uninstall: nvidia-cufft-cu12\n",
            "    Found existing installation: nvidia-cufft-cu12 11.3.0.4\n",
            "    Uninstalling nvidia-cufft-cu12-11.3.0.4:\n",
            "      Successfully uninstalled nvidia-cufft-cu12-11.3.0.4\n",
            "  Attempting uninstall: nvidia-cuda-runtime-cu12\n",
            "    Found existing installation: nvidia-cuda-runtime-cu12 12.6.77\n",
            "    Uninstalling nvidia-cuda-runtime-cu12-12.6.77:\n",
            "      Successfully uninstalled nvidia-cuda-runtime-cu12-12.6.77\n",
            "  Attempting uninstall: nvidia-cuda-cupti-cu12\n",
            "    Found existing installation: nvidia-cuda-cupti-cu12 12.6.80\n",
            "    Uninstalling nvidia-cuda-cupti-cu12-12.6.80:\n",
            "      Successfully uninstalled nvidia-cuda-cupti-cu12-12.6.80\n",
            "  Attempting uninstall: nvidia-cublas-cu12\n",
            "    Found existing installation: nvidia-cublas-cu12 12.6.3.3\n",
            "    Uninstalling nvidia-cublas-cu12-12.6.3.3:\n",
            "      Successfully uninstalled nvidia-cublas-cu12-12.6.3.3\n",
            "  Attempting uninstall: fsspec\n",
            "    Found existing installation: fsspec 2024.10.0\n",
            "    Uninstalling fsspec-2024.10.0:\n",
            "      Successfully uninstalled fsspec-2024.10.0\n",
            "  Attempting uninstall: nvidia-cusolver-cu12\n",
            "    Found existing installation: nvidia-cusolver-cu12 11.7.1.2\n",
            "    Uninstalling nvidia-cusolver-cu12-11.7.1.2:\n",
            "      Successfully uninstalled nvidia-cusolver-cu12-11.7.1.2\n",
            "  Attempting uninstall: nvidia-cudnn-cu12\n",
            "    Found existing installation: nvidia-cudnn-cu12 9.5.1.17\n",
            "    Uninstalling nvidia-cudnn-cu12-9.5.1.17:\n",
            "      Successfully uninstalled nvidia-cudnn-cu12-9.5.1.17\n",
            "  Attempting uninstall: torch\n",
            "    Found existing installation: torch 2.5.0+cu121\n",
            "    Uninstalling torch-2.5.0+cu121:\n",
            "      Successfully uninstalled torch-2.5.0+cu121\n",
            "  Attempting uninstall: tokenizers\n",
            "    Found existing installation: tokenizers 0.19.1\n",
            "    Uninstalling tokenizers-0.19.1:\n",
            "      Successfully uninstalled tokenizers-0.19.1\n",
            "  Attempting uninstall: transformers\n",
            "    Found existing installation: transformers 4.44.2\n",
            "    Uninstalling transformers-4.44.2:\n",
            "      Successfully uninstalled transformers-4.44.2\n",
            "  Attempting uninstall: torchvision\n",
            "    Found existing installation: torchvision 0.20.0+cu121\n",
            "    Uninstalling torchvision-0.20.0+cu121:\n",
            "      Successfully uninstalled torchvision-0.20.0+cu121\n",
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.\n",
            "torchaudio 2.5.0+cu121 requires torch==2.5.0, but you have torch 2.4.0 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0mSuccessfully installed anthropic-0.39.0 compressed-tensors-0.6.0 datasets-3.1.0 decord-0.6.0 dill-0.3.8 diskcache-5.6.3 fastapi-0.115.4 fsspec-2024.9.0 gguf-0.10.0 hf-transfer-0.1.8 httptools-0.6.4 interegular-0.3.3 jedi-0.19.1 lark-1.2.2 litellm-1.51.3 lm-format-enforcer-0.10.6 mistral-common-1.4.4 modelscope-1.19.2 msgspec-0.18.6 multiprocess-0.70.16 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-ml-py-12.560.30 nvidia-nccl-cu12-2.20.5 nvidia-nvtx-cu12-12.1.105 outlines-0.0.46 partial-json-parser-0.2.1.1.post4 prometheus-fastapi-instrumentator-7.0.0 pyairports-2.1.1 pycountry-24.6.1 python-dotenv-1.0.1 python-multipart-0.0.17 ray-2.38.0 sglang-0.3.5 starlette-0.41.2 tiktoken-0.7.0 tokenizers-0.20.2 torch-2.4.0 torchao-0.6.1 torchvision-0.19.0 transformers-4.46.1 triton-3.0.0 uvicorn-0.32.0 uvloop-0.21.0 vllm-0.6.3.post1 watchfiles-0.24.0 websockets-13.1 xformers-0.0.27.post2 xxhash-3.5.0 zmq-0.0.0\n",
            "Looking in indexes: https://flashinfer.ai/whl/cu121/torch2.4/\n",
            "Collecting flashinfer\n",
            "  Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6%2Bcu121torch2.4-cp310-cp310-linux_x86_64.whl (1322.8 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 GB\u001b[0m \u001b[31m416.0 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hInstalling collected packages: flashinfer\n",
            "Successfully installed flashinfer-0.1.6+cu121torch2.4\n"
          ]
        }
      ],
      "source": [
        "!pip install \"sglang[all]\"\n",
        "\n",
        "# Install FlashInfer accelerated kernels\n",
        "!pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_wsT7Eb_1esZ"
      },
      "source": [
        "## Overview\n",
        "\n",
        "SGLang offers a fast backend runtime and flexible frontend language. To showcase the different ways in which Gemma 2 can be prompted using SGLang, this notebook is divided into the following sections:\n",
        "1. Launch a HTTP server using SGLang. Use Python `requests` to prompt Gemma using SGLang's native genration APIs.\n",
        "2. Set up a SGLang backend inference engine to prompt Gemma without a HTTP server.\n",
        "3. Use SGLang's frontend generation language to prompt Gemma and also explore a few of its capabilities."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tMfQBPRYsknZ"
      },
      "source": [
        "## 1. Sending requests to SGLang server running Gemma 2\n",
        "\n",
        "In this section, you will launch an HTTP server to run Gemma 2 using SGLang and send a prompt to the model using the native generation API endpoint."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yVkCatK-twQH"
      },
      "source": [
        "### Launch a server\n",
        "\n",
        "The SGLang server can be launched by running the following command in the terminal:\n",
        "\n",
        "`python -m sglang.launch_server --model-path google/gemma-2-2b-it --port YOUR_PREFERRED_PORT`\n",
        "\n",
        "In a Colab environment, you must run the SGLang server as a Python subprocess and manage its termination using Python's `subprocess` package. SGLang provides some utility methods that abstract these details for you. The `execute_shell_command` function lets you launch the server as a Python subprocess, while the `wait_for_server` function waits for the server to be up and running before you can send requests to it.\n",
        "\n",
        "You can specify Gemma 2's Hugging Face repo ID directly for the `--model-path` argument. SGLang will download the necessary files from the Hugging Face repository to start the server.\n",
        "\n",
        "You can set any port of your choice to run SGLang using the `--port` argument.\n",
        "\n",
        "Throughout this notebook, `--mem-fraction-static` is set to 0.6 to avoid CUDA Out of Memory errors when running on the Colab free tier. Setting the `--mem-fraction-static` argument to a lower value reduces the memory usage of the KV cache memory pool. Feel free to experiment with different values according to your use case.\n",
        "\n",
        "**Note**: The following code snippet defines a function that executes the shell command and waits for the server to be ready. This function will be reused later in this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "KK8ltIwzzmNP"
      },
      "outputs": [],
      "source": [
        "from sglang.utils import (\n",
        "    execute_shell_command,\n",
        "    wait_for_server,\n",
        "    terminate_process,\n",
        ")\n",
        "\n",
        "def start_server():\n",
        "\n",
        "  process = execute_shell_command(\n",
        "  \"\"\"\n",
        "  python -m sglang.launch_server --model-path google/gemma-2-2b-it \\\n",
        "  --mem-fraction-static 0.6 \\\n",
        "  --port 9000\n",
        "  \"\"\"\n",
        "  )\n",
        "\n",
        "  wait_for_server(\"http://localhost:9000\")\n",
        "\n",
        "  return process"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EjEmOL3cwo8C"
      },
      "source": [
        "Invoke the previously defined `start_server` function to start the server and obtain a reference to the server process.\n",
        "\n",
        "**Note**: It takes 2 - 4 minutes for the server to be up and running."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "a3y4oCDGCNCM"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<strong style='color: #00008B;'><br><br>                    NOTE: Typically, the server runs in a separate terminal.<br>                    In this notebook, we run the server and notebook code together, so their outputs are combined.<br>                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.<br>                    </strong>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "server_process = start_server()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5sQe_RMaxsN2"
      },
      "source": [
        "The server is now ready and can be reached at http://localhost:9000/ from within this notebook."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9d7queczwzKk"
      },
      "source": [
        "### Send a request to Gemma 2 using SGLang's Native Generation API\n",
        "\n",
        "The following code snippet uses Python's `requests` library to invoke SGLang's native generation API on the server to send a prompt to Gemma-2.\n",
        "You can specify your preferred values for the sampling parameters like `temperature`, `top_p` `max_new_tokens` etc.\n",
        "\n",
        "For a full list of sampling parameters supported by SGLang, please refer to SGLang's [Sampling Parameters in SGLang Runtime](https://sgl-project.github.io/references/sampling_params.html) guide.\n",
        "\n",
        "\n",
        "To generate a streaming response from the model, specify an additional key, `stream` set to `True` in the request json and set the `stream` parameter of `requests.post` to `True`.\n",
        "\n",
        "An example of a streaming generation is provided in SGLang's [Quick Start](https://sgl-project.github.io/start/send_request.html#Streaming) documentation.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "0OjdqIfJ2ibe"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{\n",
            "  \"text\": \" \\n\\nI'm confused. \\n\\nIs it billions of years old?\\n\\nPlease explain. \\n\\n\\nYou're right to be confused! It's a big number. Here's a breakdown:\\n\\n**Earth is about 4.54 \\u00b1 0.05 billion years old.**\\n\\n* **Billions:** This means it's older than you and me, for sure! \\n* **4.54 billion:**  This is the most precise estimate we have. \\n* **\\u00b1 0.05:** This means there's a range of 0.0\",\n",
            "  \"meta_info\": {\n",
            "    \"prompt_tokens\": 8,\n",
            "    \"completion_tokens\": 128,\n",
            "    \"completion_tokens_wo_jump_forward\": 128,\n",
            "    \"cached_tokens\": 1,\n",
            "    \"finish_reason\": {\n",
            "      \"type\": \"length\",\n",
            "      \"length\": 128\n",
            "    },\n",
            "    \"id\": \"1485c86977304adf92b2db1f77054a07\"\n",
            "  }\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "import requests\n",
        "import json\n",
        "\n",
        "response = requests.post(\n",
        "    \"http://localhost:9000/generate\",\n",
        "    json={\n",
        "        \"text\": \"What is the age of earth?.\",\n",
        "        \"sampling_params\": {\n",
        "            \"temperature\": 0.8,\n",
        "        },\n",
        "    },\n",
        ")\n",
        "print(json.dumps(response.json(), indent=2))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hrbOBp7K1Jl5"
      },
      "source": [
        "You can stop the server by using the `terminate_process` function from `sglang.utils`. This is equivalent to pressing Ctrl+C to stop the server from the terminal.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "xHbafDwk25FX"
      },
      "outputs": [],
      "source": [
        "terminate_process(server_process)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eADwG1GAgqtb"
      },
      "source": [
        "## 2. Offline batch inference using SGLang backend engine"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Twjw3Ljfgxu9"
      },
      "source": [
        "SGLang provides an inference engine that allows you to directly interact with local models like Gemma 2 without requiring an HTTP server. You can use this for building custom servers or for offline batch inference.\n",
        "\n",
        "In this section, you will initialize the inference engine to run Gemma 2 and send a batch of prompts to it.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZDWz_5NG6SmD"
      },
      "source": [
        "### Initialize SGLang inference engine with Gemma 2\n",
        "\n",
        "Create an instance of `sglang.Engine` class to run Gemma 2 by specifying its Hugging repo ID for the`model_path` argument."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "3YpWIi9NZQ-g"
      },
      "outputs": [],
      "source": [
        "from sglang import Engine\n",
        "\n",
        "llm = Engine(model_path=\"google/gemma-2-2b-it\", mem_fraction_static=0.6)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MBGjQlws7kvb"
      },
      "source": [
        "### Batch prompting Gemma 2 using SGLang inference engine\n",
        "\n",
        "You can send a batch of prompts for inference to the SGLang engine in one of the following ways:\n",
        "\n",
        "1. Non-streaming synchronous call\n",
        "2. Streaming synchronous call\n",
        "3. Non-streaming asynchronous call\n",
        "4. Streaming asynchronous call\n",
        "\n",
        "You will explore how to perform inference on a batch of prompts using the SGLang engine's synchronous generation function to generate both streaming and non-streaming responses from Gemma 2 in the following sections.\n",
        "\n",
        "You can refer to SGLang's [Offline Engine API](https://sgl-project.github.io/backend/offline_engine_api.html) guide for examples of asynchronous response generation."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9pbtLg3s7NKT"
      },
      "source": [
        "### Non-streaming synchronous prompting\n",
        "\n",
        "Define a list of prompts to query Gemma 2 with."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "xDfPcOnRKeBd"
      },
      "outputs": [],
      "source": [
        "prompts = [\n",
        "    \"Summarize what a galaxy is in three to four lines.\",\n",
        "    \"List any 3 observatories in the world.\",\n",
        "]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t-WOeetf8_zt"
      },
      "source": [
        "Generate a batch of non-streaming responses from Gemma 2 using the inference engine's `generate` function. Pass the list of prompts you defined earlier and an optional dictionary of sampling parameters to this function. The function returns a list of complete responses from the model to the batch of prompts.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "iErTZQ1Gfxyu"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "=================================================================\n",
            "\n",
            "Prompt: Summarize what a galaxy is in three to four lines.\n",
            "\n",
            "Generated text: \n",
            "\n",
            "A galaxy is a vast collection of stars, gas, dust, and dark matter held together by gravity. It is a massive, gravitationally bound system that can range in size from a few hundred thousand to billions of stars. Galaxies come in various shapes and sizes, from spiral galaxies like our Milky Way, to elliptical galaxies, and irregular galaxies. \n",
            "\n",
            "\n",
            "=================================================================\n",
            "\n",
            "Prompt: List any 3 observatories in the world.\n",
            "\n",
            "Generated text: \n",
            "\n",
            "Here are 3 observatories in the world:\n",
            "\n",
            "1. **Keck Observatory:** Located on Mauna Kea in Hawaii, the Keck Observatory is home to two of the world's largest optical/infrared telescopes.\n",
            "2. **Very Large Telescope (VLT):** Located in the Atacama Desert of Chile, the VLT is a collection of four telescopes that work together to provide high-resolution images of distant objects.\n",
            "3. **James Webb Space Telescope (JWST):** Launched in December 2021, the JWST is the largest and most powerful space telescope ever built, designed to study\n",
            "\n"
          ]
        }
      ],
      "source": [
        "sampling_params = {\"temperature\": 0.1}\n",
        "outputs = llm.generate(prompts, sampling_params)\n",
        "\n",
        "for prompt, output in zip(prompts, outputs):\n",
        "    print(\"=================================================================\\n\")\n",
        "    print(f\"Prompt: {prompt}\\n\\nGenerated text: {output['text']}\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3nacKFGphNRS"
      },
      "source": [
        "### Streaming synchronous prompting\n",
        "\n",
        "To generate streaming responses from the model to the previously defined batch of prompts, iterate over the `prompts` and invoke the inference engine's `generate` function with an additional argument `stream` set to `True`. You can access each chunk in the streaming response by iterating over the response of the `generate` function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "XcEw-OLIhMTT"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "===============================================================\n",
            "\n",
            "\n",
            "Prompt: Summarize what a galaxy is in three to four lines.\n",
            "\n",
            "Generated text: \n",
            "\n",
            "\n",
            "A galaxy is a massive collection of stars, gas, dust, and dark matter held together by gravity. These vast structures range in size from a few tens of thousands to billions of light-years across. Galaxies are the building blocks of the universe, containing billions of stars and countless planets. They come in various shapes and sizes, from spiral galaxies like our own Milky Way to elliptical galaxies and irregular galaxies. \n",
            "\n",
            "===============================================================\n",
            "\n",
            "\n",
            "Prompt: List any 3 observatories in the world.\n",
            "\n",
            "Generated text: \n",
            "\n",
            "\n",
            "Here are 3 observatories in the world:\n",
            "\n",
            "1. **Keck Observatory:** Located on Mauna Kea in Hawaii, the Keck Observatory is one of the world's most powerful optical/infrared telescopes.\n",
            "2. **Very Large Telescope (VLT):** Located in the Atacama Desert of Chile, the VLT is a collection of four telescopes that work together to provide high-resolution images of distant objects.\n",
            "3. **European Southern Observatory (ESO) Very Large Telescope (VLT):** Located in the Atacama Desert of Chile, the VLT is a collection of four telescopes that work together to"
          ]
        }
      ],
      "source": [
        "for prompt in prompts:\n",
        "    print(\"\\n===============================================================\\n\")\n",
        "    print(f\"\\nPrompt: {prompt}\\n\")\n",
        "    print(\"Generated text: \\n\", end=\"\", flush=True)\n",
        "\n",
        "    for chunk in llm.generate(prompt, sampling_params, stream=True):\n",
        "        print(chunk[\"text\"], end=\"\", flush=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1WDnWWAsB4AG"
      },
      "source": [
        "Now you can shut down and clean up the SGLang inference engine."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "cVxEAEnaf8J7"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "W1105 16:28:30.518000 131977277298240 torch/_inductor/compile_worker/subproc_pool.py:126] SubprocPool unclean exit\n"
          ]
        }
      ],
      "source": [
        "llm.shutdown()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5yZLlWbDCELZ"
      },
      "source": [
        "## 3. Inference using Frontend Structured Generated Language (SGLang)\n",
        "\n",
        " In addition to the HTTP server and the offline backend engine, SGLang also offers a frontend language that supports more customization and complex prompting workflows.\n",
        "\n",
        "In the following sections, you will explore how to start a multi-turn conversation with Gemma 2 using SGLang's frontend language. You will also see how to obtain responses from Gemma 2 in JSON format."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NQaxIIlESnaG"
      },
      "source": [
        "### Launch a server\n",
        "\n",
        "First, you must launch a server using SGLang specifying the Hugging Face repo ID of Gemma 2. You can use the function defined in the introductory sections to launch the server."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "ZjUJyB8zTD0K"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<strong style='color: #00008B;'><br><br>                    NOTE: Typically, the server runs in a separate terminal.<br>                    In this notebook, we run the server and notebook code together, so their outputs are combined.<br>                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.<br>                    </strong>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "server_process = start_server()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "s43t42ONTIgT"
      },
      "source": [
        "Use the `function` decorator provided by SGLang to define a function that accepts a few questions you want to ask the model as its arguments. The `user` function is used to add the user's question to the conversation. The `sglang.gen` function is used to generate a response from the model, which is in turn appended to the conversation using the `assistant` function.\n",
        "\n",
        "The function prompts the model with `question_1` and then in turn prompts it with `question_2`. The model is expected to answer `question_2` based on the history of the conversation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "PWyrsLLEkTfl"
      },
      "outputs": [],
      "source": [
        "from sglang import function, user, assistant, gen, set_default_backend, RuntimeEndpoint\n",
        "\n",
        "@function\n",
        "def multi_turn_question(s, question_1, question_2):\n",
        "    s += user(question_1)\n",
        "    s += assistant(gen(\"answer_1\", max_tokens=128))\n",
        "    s += user(question_2)\n",
        "    s += assistant(gen(\"answer_2\", max_tokens=128))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "axiXF9-7WzL1"
      },
      "source": [
        "### Connect to the server"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hSexW_7sWOu8"
      },
      "source": [
        "Connect to the server using `sglang.set_default_backend` by specifying its URL."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "id": "5zjwlk8cD3xU"
      },
      "outputs": [],
      "source": [
        "set_default_backend(RuntimeEndpoint(\"http://localhost:9000\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f8OqAkTAc0TB"
      },
      "source": [
        "### Send multi-turn questions to Gemma 2\n",
        "\n",
        "Now, you can run the previously defined `multi_turn_question` function to generate responses from the model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "_0gjddhcWqrq"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "user : Who are the first humans to land on the moon?\n",
            "assistant : The first humans to ever land on the moon were a team from the **Apollo 11 mission**:\n",
            "\n",
            "* **Neil Armstrong**:  He became the first person to walk on the moon. His famous quote, \"One small step for man, one giant leap for mankind,\" encapsulates the magnitude of thishistoric event.\n",
            "* **Buzz Aldrin**: Aldrin was the second human to walk on the moon and stayed with Armstrong for several hours on the lunar surface. \n",
            "\n",
            "They landed on the moon on **July 20, 1969**, bringing back a wealth of lunar samples and photos that remain highly significant\n",
            "user : Which country did they belong to?\n",
            "assistant : The first people to land on the moon were part of **the United States**, often simply referred to as Americans.  They were a team from NASA, the National Aeronautics and Space Administration, the US government's space program. \n",
            "\n"
          ]
        }
      ],
      "source": [
        "state = multi_turn_question.run(\n",
        "    question_1=\"Who are the first humans to land on the moon?\",\n",
        "    question_2=\"Which country did they belong to?\",\n",
        ")\n",
        "\n",
        "for m in state.messages():\n",
        "  print(m[\"role\"], \":\", m[\"content\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MPddl6aldM-J"
      },
      "source": [
        "Notice how the history of the conversation is preserved, and the model answered the second question as a continuation of the conversation."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3x5c2h1jdbWJ"
      },
      "source": [
        "### Run a batch of multi-turn questions\n",
        "\n",
        "You can also batch a set of multi turn questions to the model by passing a list of dictionaries to `run_batch` whose keys specify the arguments to the `multi_turn_question` function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "rJ2OyNYo7PtM"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "===============================================================\n",
            "\n",
            "user : Who are the first humans to land on moon?\n",
            "assistant : The first humans to land on the Moon were **Neil Armstrong** and **Buzz Aldrin** of the Apollo 11 mission on **July 20, 1969.** \n",
            "\n",
            "user : Which country did they belong to ?\n",
            "assistant : Neil Armstrong and Buzz Aldrin were from the **United States**. \n",
            "\n",
            "\n",
            "===============================================================\n",
            "\n",
            "user : Who is the first human to reach space?\n",
            "assistant : The first human to reach space was **Yuri Gagarin**. \n",
            "\n",
            "On April 12, 1961, he completed one orbit of Earth in the Soviet Vostok 1 spacecraft. This event marked a significant moment in the history of human exploration, paving the way for further advancements and spaceflight feats. \n",
            "\n",
            "user : Which country did they belong to?\n",
            "assistant : Yuri Gagarin was from **Soviet Union** at the time. \n",
            "\n"
          ]
        }
      ],
      "source": [
        "states = multi_turn_question.run_batch(\n",
        "    [\n",
        "        {\n",
        "                \"question_1\": \"Who are the first humans to land on moon?\",\n",
        "                \"question_2\": \"Which country did they belong to ?\",\n",
        "            },\n",
        "        {\n",
        "                \"question_1\": \"Who is the first human to reach space?\",\n",
        "                \"question_2\": \"Which country did they belong to?\",\n",
        "        },\n",
        "    ]\n",
        ")\n",
        "\n",
        "for state in states:\n",
        "  print(\"\\n===============================================================\\n\")\n",
        "  for message in state.messages():\n",
        "    print(message[\"role\"], \":\", message[\"content\"])\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HGsAknP7ZARH"
      },
      "source": [
        "### JSON Decoding\n",
        "\n",
        "You can use a regular expression (regex) to specify a JSON schema that the model's generated answer must adhere to.\n",
        "\n",
        "Define a function to generate specific information about any animal in JSON format using Gemma 2. Specify the regex JSON schema in the regex argument of the sglang.gen function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "id": "xinEZ7x3dkl4"
      },
      "outputs": [],
      "source": [
        "character_regex = (\n",
        "    r\"\"\"\\{\\n\"\"\"\n",
        "    + r\"\"\"    \"name\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
        "    + r\"\"\"    \"type\": \"(Mammals|Birds|Fish|Reptiles|Amphibians|Invertebrates)\",\\n\"\"\"\n",
        "    + r\"\"\"    \"reproduction\": \"(Sexual|Asexual)\",\\n\"\"\"\n",
        "    + r\"\"\"    \"life expectancy\": \"[0-9]{1,2}\",\\n\"\"\"\n",
        "    + r\"\"\"\\}\"\"\"\n",
        ")\n",
        "\n",
        "@function\n",
        "def animal_gen(s, name):\n",
        "    s += name + \" is an animal. Please fill in the following information about this animal.\\n\"\n",
        "    s += gen(\"json_output\", max_tokens=256, regex=character_regex)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8-NHP4_ffQ3m"
      },
      "source": [
        "Run the function with the name of any animal as input to get its features in JSON format."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "id": "v4NwLmms4_Jp"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Fish is an animal. Please fill in the following information about this animal.\n",
            "{\n",
            "    \"name\": \"Fish\",\n",
            "    \"type\": \"Mammals\",\n",
            "    \"reproduction\": \"Sexual\",\n",
            "    \"life expectancy\": \"10\",\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "state = animal_gen.run(name=\"Fish\")\n",
        "print(state.text())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qhl8qCPYRTSW"
      },
      "source": [
        "Terminate the server process."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sL8n9mBtRLXs"
      },
      "outputs": [],
      "source": [
        "terminate_process(server_process)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vlRAC8OFa1pQ"
      },
      "source": [
        "These are just a few examples of how a prompting workflow with Gemma 2 can be designed using SGLang's frontend language. To learn more about its capabilities, you can refer to SGLang's [Frontend: Structured Generation Language (SGLang)](https://sgl-project.github.io/frontend/frontend.html) guide."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NKR7JKQR5s9U"
      },
      "source": [
        "Congratulations! You've successfully explored how Gemma 2 can be served using SGLang, run using the SGLang backend runtime and frontend language in a Colab environment. You can now experiment with more complex prompting workflows in SGLang to interact with Gemma 2."
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "[Gemma_2]Using_with_SGLang.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
